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Abstract 

This  paper  is  concerned  with  probability  density  estimation  in  high-dimensional  settings.  Simplified  geometric  argu¬ 
ments  and  supporting  examples  point  to  a  performance  bound  which  limits  algorithm  performance  to  that  of  either  ( 1 ) 
nearest-neighbor  or  (2)  single-kernel  PDF  estimators.  A  method  of  monitoring  PDF  estimation  performance  as  well  as 
recommendations  for  neural  net  and  classification  algorithm  practitioners  is  provided. 


1  Introduction 

The  subject  of  this  paper  is  non-parametric  probability  density  function  (PDF)  estimation  in  high-dimensional  settings. 
As  such,  it  is  relevant  to  signal  processing,  estimation,  and  classification. 

We  refer  herein  to  the  data  dimension  as  P,  the  largest  number  of  random  variables  to  be  considered  at  once,  i.e. 
in  a  joint  distribution.  When  the  form  of  the  PDF  to  estimate  is  entirely  unknown  and  must  be  estimated  from  training 
data,  the  PDF  value  itself  at  quantized  “grid  cell’’  locations  effectively  become  parameters  to  estimate.  The  number  of 
parameters  effectively  rises  exponentially  with  P.  The  rapid  increase  in  complexity  of  systems  has  been  termed  the  curse 
of  dimensionality  by  Richard  Bellman  [1],  Indeed,  it  has  been  shown  that  given  that  the  PDF  meets  certain  smoothness 
assumptions,  the  amount  of  training  data  required  for  nonparametric  estimators  rises  exponentially  in  P  [2],  It  is  conjec¬ 
tured  by  some  researchers  that  the  underlying  structure  of  the  data  in  most  problems  has  a  dimension  rarely  larger  than 
about  4  or  5  [3],  Thus,  some  kind  of  transformation  or  projection  onto  a  lower  dimensional  manifold  is  recommended. 
The  most  obvious  method  of  dimension  reduction  is  simply  to  discard  the  least  important  features,  a  process  called  feature 
selection  a  well-studied  problem  in  itself  [4],  [5],  The  dilemma  is  that  the  true  information  content  of  a  feature  cannot 
be  measured  without  a  good  joint  PDF  estimate  of  all  the  features.  On  the  other  hand,  a  good  PDF  estimate  cannot  be 
obtained  at  a  high  dimension.  As  features  are  added,  increasing  P,  it  is  possible  that  algorithm  performance  may  actually 
get  worse  in  spite  of  information  content  of  the  added  feature.  Likewise,  eliminating  an  information-bearing  feature  may 
actually  improve  performance.  Dimension  reduction  is  a  subject  of  ongoing  research  [6],  [7],  [3],  [8],  That  dimensional¬ 
ity  is  an  overriding  problem  may  at  first  contradict  the  fact  that  PDF  estimators  have  been  employed  successfully  at  very 
high  dimension.  In  this  paper,  we  argue  that  in  such  applications,  true  PDF  estimation  is  not  happening.  What  is  really 
happening  is  explained  below  in  the  context  of  one  of  two  possible  primitive  PDF  estimators. 

Projection  of  the  PDF  estimate  onto  one  or  two  dimensions  is  an  effective  method  of  monitoring  PDF  accuracy.  If  the 
low-dimensional  projections  are  flawed,  so  must  be  the  PDF  estimate  on  1ZP .  But  accurate  projections  do  not  guarantee 
good  PDF  estimates.  It  is  easy  to  be  “fooled”  by  the  apparent  size  of  the  kernels.  An  example  that  illustrates  this 
phenomenon  is  the  following  [3],  Let  the  marginal  distribution  of  each  data  dimension  be  distributed  uniformly  within  the 
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interval  [—1,  1],  The  data  is  therefore  contained  in  the  hypercube  [—1,  1]  p .  Imagine  a  radius- 1  hypersphere  inscribed 
inside  the  hypercube,  touching  the  center  of  each  “face”.  As  P  grows,  the  fraction  of  data  that  falls  inside  the  hypersphere 
falls  exponentially  to  zero.  This  is  counter-intuitive  because  the  inscribed  hypersphere  is  very  large  when  projected  onto 
any  axis.  We  suggest  an  alternate  approach  that  is  not  as  easily  fooled. 

The  object  of  this  paper  is  not  to  offer  a  solution  to  the  dimensionality  problem,  but  to  offer  an  explanation  for  its 
existence  and  argue  that  it  is  not  solvable  when  attacked  as  a  high-dimensional  problem.  Solutions  are  offered  in  other 
papers  [9],  [10]. 

We  begin  the  paper  with  a  geometric  argument  to  expose  the  nature  of  the  curse  of  dimensionality.  This  argument  is 
concisely  represented  by  heuristic  bound  on  performance.  This  provides  a  setting  for  the  remainder  of  the  paper.  Next, 
we  suggest  methods  of  assessing  PDF  accuracy  that  do  not  rely  on  classifier  performance.  We  end  the  paper  with  our 
concluding  remarks  and  recommendations. 

2  Heuristic  Performance  Bounds 

The  impetus  for  this  work  was  that  in  high  dimensional  problems,  it  was  found  from  experience  that: 

1.  Complex  classifiers  rarely  work  better  than  simple  classifiers  (Fisher’s  linear  discriminant,  quadratic  Bayesian,  or 
Nearest  Neighbor  classifiers). 

2.  Simple  classifiers  tend  to  improve  as  the  feature  set  dimension  increases. 

3.  Complex  classifiers  improve  with  increasing  dimension  initially,  but  then  performance  stops  improving,  or  drops  as 
dimension  increases.  As  dimension  continues  to  increase,  performance  sometimes  improves. 

The  arguments  to  be  presented  attempt  to  explain  this  behavior  using  simple  geometric  arguments.  By  no  means  do  we 
attempt  to  find  quantitative  results  that  can  be  verified  experimentally.  However,  because  the  arguments  are  basic,  it  should 
be  clear  to  the  readers  whether  or  not  the  assumptions  are  applicable  to  their  problems. 

1 .  This  paper  assumes  that  performance  of  algorithms  is  dependent  only  on  PDF  estimation.  Clearly  classification 
performance  depends  only  on  accurate  PDF  estimation  in  the  boundary  regions  between  signal  classes.  Yet,  it  can 
be  argued  that  if  we  limit  ourselves  to  speaking  about  these  localized  regions  only,  the  same  arguments  hold. 

2.  We  make  arguments  based  on  kernel-based  PDF  estimation.  It  can  be  said  that  our  arguments  do  not  apply  to  other 
methods.  However  consider  that: 

(a)  the  minimum  probability  of  error  is  achieved  in  theory  by  the  probabilistic  Bayesian  classifier  which  requires 
the  PDF  [11], 

(b)  kernel-based  PDF  estimators  converge  to  the  true  PDF  given  smoothness  of  the  PDF,  enough  data  and  enough 
kernels  [12], 

(c)  Most  Neural  Networks  actually  are  PDF  estimators  [13], 

To  develop  a  language  for  the  arguments  to  be  presented,  we  consider  four  broad  classes  of  PDF  estimators: 

1.  Variable  Basis  Function  (VBF)  The  PDF  is  approximated  as  a  sum  of  positive  basis  functions.  It  is  assumed  that 
the  approximation  algorithm  maximizes  the  approximation  fit  to  the  training  data  by  determining  the  best  set  of 
basis  function  locations  and  individual  shapes  and  sizes. 

2.  Uniform  Basis  Function  (UBF)  The  PDF  is  approximated  as  a  sum  of  positive  basis  functions  with  the  con¬ 
straint  that  all  basis  functions  are  identical  except  for  location.  This  includes  PNN  and  homoscedastic  Gaussian 
mixtures  and  applies  in  a  broad  sense  to  to  the  case  when  all  basis  functions  have  the  same  volume ,  for  example 
strophoscedastic  mixtures  [14],  It  is  assumed  that  the  approximation  algorithm  maximizes  the  approximation  fit  to 
the  training  data  by  determining  the  best  set  of  basis  function  locations  and  overall  basis  function  size. 

3.  Single  Basis  Function  (SBF)  This  is  the  special  case  of  the  first  two  methods  where  there  is  only  one  basis  function. 
It  is  assumed  that  the  approximation  algorithm  maximizes  the  approximation  fit  to  the  training  data  by  determining 
the  best  single  basis  function  (location,  shape  and  size).  Examples  are  the  quadratic-Bayesian  classifier  or  Fisher’s 
linear  discriminant. 
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4.  K-Nearest  Neighbor  (KNN)  The  probability  density  is  estimated  from  the  volume  of  a  ball  (or  other  basis  function) 
that  contains  K  samples  from  the  training  set. 

In  this  paper  we  use  the  terms  kernel  and  basis  function  interchangeably. 

Consider  the  problem  of  estimating  an  arbitrary  PDF  from  N  data  samples  of  a  random  variable  x  using  a  UBF 
PDF  estimate.  The  PDF  is  then  used  in  an  algorithm  whose  performance  may  be  quantified.  Consider  x  to  be  infinite- 
dimensional  (such  data  may  be  formed  by  adding  random  non-informative  data  to  each  sample  of  a  finite-dimensional 
data  set).  Let  xp  be  the  first  P  dimensions  of  x,  with  PDF  pP(np).  Let  I(pP)  define  the  expected  performance  of  the 
algorithm  when pP(np)  is  used.  Clearly, 

I(Pr)  >  i(pq);  P>  Q- 

Letpi,(xp)  be  an  estimate  of  pP(xp)  derived  from  N  data  samples.  Assume  that 

i (pi 0  <  i(pP)- 

Let  the  probability  mass  of  pP(xp)  be  confined  (mostly)  to  the  P-dimensional  hypercube  defined  by  0  <  x  i  < 
D.  i  =  1, P  and  let  do  be  the  smallest  dimension  of  local  variations  in  pP(xF).  The  determination  of  do  is 
illustrated  in  Figure  1 .  In  the  figure,  do  is  determined  separately  for  each  dimension  by  the  smallest  variation  or  “peak” 
in  a  sectional  slice  at  a  fixed  value  of  the  other  dimension(s).  It  is  the  largest  kernel  width  in  that  dimension  that  a  UBF 
estimator  of  the  PDF  could  have  and  still  provide  an  accurate  approximation  to  the  true  PDF.  For  simplicity,  we  assume  it 
is  the  same  for  each  dimension,  i.e.  do,i  =  do-  Let  Q (N.  P,  M,  d)  be  the  expected  algorithm  performance  when  the  PDF 

D  — 


f 


o  — 

0  ;  d 

J  4.i 

Figure  1:  Determination  of  do  from  data  samples. 

is  approximated  using  a  UBF  mixture  with  M  mixture  components  of  diameter  d ,  and  obtained  from  N  data  samples. 
The  Heuristic  Performance  Bounds  state  that 

Q(N,P,M,d )  <  I(pP)A(d) 

Q(N,P,M,d)  <  I(pP)  V(M,  d,  P).  (1) 

We  now  describe  each  term. 

•  We  have  described  I(pP),  the  ideal  expected  performance,  above. 

•  A(d)  is  the  Potential  Accuracy  for  M  infinite.  This  term  represents  the  loss  due  only  to  the  diameter  of  the  mixture 
components,  d  exceeding  do  (i.e.,  for  M  infinite).  A(d)  is  such  that 

lim  Aid)  =  1 

d-Hlo 

A(di )  <  Aidf),  o?2  <  di 
lim  A(d)  =  a  >  0 

d — yD 
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The  last  statement  says  that  performance  does  not  go  to  zero,  even  for  very  large  kernel  diameters.  On  the  contrary, 
performance  approaches  that  of  a  SBF  PDF  estimator.  An  example  of  a  function  meeting  these  requirements  is 


A(d) 


a  +  (1  —  a) 


dp/d 
1  +  dp  /  d 


(2) 


•  V  (M,  d,  P)  represents  the  loss  due  only  to  volume  coverage.  Clearly  the  volume  occupied  by  the  approximation  is 
no  greater  than  Mdp .  Let  volume  occupied  by pP(xp)  be  denoted  Vo  =  (7 D)p,  where  0  <  7  <  1.  The  point  at 
which  the  volume  of  the  approximation  matches  the  volume  occupied  by  pP(xp)  is  when  Mdp  =  Vo-  Define  this 

point  to  be  d*  =  (-jg-)  p  =  jDM~^ .  Thus, 


lim  V(M,  d,  P)  = 

i^tD 

V{M,di,P)  < 
lim  V  (M,  d,  P)  = 

d — ^0 


V(M,d2,P),  d2>di 
/3>  0 


The  last  statement  says  that  performance  does  not  go  to  zero,  even  for  very  small  diameters.  On  the  contrary,  perfor¬ 
mance  approaches  that  of  a  KNN  classifier.  We  have  assumed  that  M  is  essentially  fixed  by  the  amount  of  available 
data,  N .  The  amount  of  data  needed  to  estimate  the  parameters  of  each  mixture  component  is  approximately  linear 
in  P  (e.g.  [15]).  This  weak  dependence  may  be  ignored.  It  should  be  expected  that  V  ( M ,  d,  P)  drops  very  rapidly 
to  8  as  d  falls  below  d* .  An  example  of  a  function  meeting  these  requirements  is 


V  (M,  d,P) 


P  +  (1  -  P) 


Mdp 

Mdp  +  (7  D)p 


We  have  mentioned  already  the  work  of  Stone  [2]  which  derives  the  fact  that  accurate  PDF  estimation  requires  N  to 
increase  exponentially  in  P.  This  corroborates  the  above  analysis  if  we  suppose  that  the  amount  of  training  data  is 
roughly  proportional  to  M,  say  N  =  kM.  In  order  to  remain  below  the  critical  point  of  volume  collapse,  we  need 

N  >  k  (tp")  ■  Some  researchers  have  derived  similar  expressions  based  on  special  cases  [6], [3],  [16]. 

At  this  point,  it  is  possible  to  make  a  very  important  observation:  A(d)  falls  as  d  increases  toward  D ,  but  V ( M ,  d,  P) 

falls  rapidly  as  d  decreases  below  d*.  But,  d*  =  (^-)  F  =  7 D  (-j)  p  approaches  7 D  as  P  rises.  This  is  illustrated  in 
Figure  2.  So,  it  may  be  concluded  that  as  P  increases,  performance  becomes  limited  to  either  a  or  8,  depending  on  the 


d/D 


Figure  2:  Ideal  performance  can  never  be  achieved  for  high  P.  Either  Potential  Accuracy  or  Volume  Coverage  is  small. 
For  this  plot,  M  =  16,  ft  =  a  =  0.1,  do  =  .1,  7  =  1 

value  of  d.  We  call  the  condition  when  d  <  d*  and  performance  approaches  8  the  collapsed  kernel  (CK)  effect.  We  call 
the  condition  when  performance  approaches  a  the  expanded  kernel  (EK)  effect.  We  call  this  the  fundamental  tradeoff,  the 
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tradeoff  associated  with  oversmoothing  vs.  undersmoothing.  While  many  practitioners  attempt  to  find  a  compromise,  we 
argue  that  at  high  dimensions,  one  can  never  escape  suffering  the  brunt  of  one  or  the  other. 

It  is  possible  to  relate  CK  and  EK  effects  to  “primitive”  classifiers  such  as  SBF  and  KNN.  The  performance  of  an  SBF 
classifier  is  a  because  the  SBF  kernel  is  matched  to  the  volume  of  the  data  PDF.  The  performance  of  a  KNN  classifier  is 
3.  A  UBF  or  VBF  classifier  with  very  narrow  basis  functions  ( d  <  d*)  and  a  kernel  located  at  each  training  sample  is 
equivalent  to  a  KNN.  It  can  be  argued  that  UBF  and  VBF  classifiers  with  more  than  one  training  sample  per  kernel  still 
have  a  similar  behavior  because  of  the  narrowness  of  the  kernels.  From  this  standpoint,  it  is  clear  why  some  complex 
Neural  Networks  work  no  better  that  simple  SBF  or  KNN  classifiers. 

In  classification  problems,  the  values  of  a  and  /3  depend  on  the  cluster-shapes  and  relative  locations  of  clusters  in  the 
high-dimensional  space. 

1.  If  the  classes  are  well-separated  and  unimodal,  a  and  3  may  both  be  high  and  either  an  SBF  or  Nearest-Neighbor 
classifier  will  work  well. 

2.  If  they  are  unimodal  but  not  well  separated,  a  will  be  high,  but  /?  will  be  low  and  SBF  classifier  is  recommended. 

3.  If  their  shapes  are  complex,  yet  are  well  separated,  3  will  be  high,  but  a  will  be  low  and  a  Nearest-Neighbor 
classifier  is  recommended. 

4.  In  practice,  the  above  two  conditions  may  exist  simultaneously  in  different  parts  of  the  data  space  because  real-world 
data  is  not  homogeneous.  VBF  classifiers  may  also  exhibit  a  little  of  both  conditions. 


3  Supporting  Example 

Virtually  all  PDF  estimation  methods  (including  neural  nets  [13])  are  consistent  in  the  sense  that  there  is  a  certain  tendency 
to  converge  to  the  true  PDF  estimate  given  enough  training  data  and/or  low  enough  dimension.  There  are  numerous 
methods  of  PDF  estimation  (for  comparative  studies  and  overviews  see  [4],  [3],  [17]).  In  this  example,  we  utilize  a 
multivariate  PDF  estimation  approach  based  on  a  heteroscedastic  Gaussian  mixture  (GM)  approximation.  A  widely 
accepted  technique  for  estimating  the  parameters  of  the  GM  model  is  the  EM  algorithm  [11], [18].  The  EM  algorithm 
suffers  from  numerical  problems  when  there  is  insufficient  data  leading  some  researchers  to  avoid  it  [17]  or  constrain  the 
covariances  of  the  kernels  to  be  identical  [19],  or  of  uniform  size  with  variable  rotation  [14].  Adding  to  the  covariance 
estimates  based  on  a  Bayesian  prior  density  argument  is  the  preferred  method  of  dealing  with  the  problem  [20],  [21],  This 
involves  simply  adding  a  diagonal  matrix,  representing  an  independent  measurement  noise  prior,  to  the  kernel  covariances 
at  each  iteration.  We  have  obtained  excellent  results  with  this  method. 

When  the  PDF  is  estimated  by  optimizing  the  likelihood  function,  such  as  an  EM  algorithm,  the  total  likelihood  of  the 
training  data  is  maximized.  Therefore,  kernels  are  caught  in  opposing  forces.  Kernels  must  become  smaller  to  increase 
their  likelihood  value,  but  larger  to  encompass  more  data.  As  P  increases  vith  N  and  M  fixed,  a  given  number  of  training 
samples  occupies  an  exponentially  decreasing  fraction  of  the  data  volume.  Kernels  tend  to  become  associated  with  disjoint 
“subsets”  of  training  data  and  tightly  enclose  the  data  subsets  in  order  to  maximize  the  likelihood.  In  the  limit,  the  data 
“subsets”  occupy  subspaces  of  zero  volume.  However,  depending  on  how  the  algorithm  is  initialized,  these  subspaces 
have  arbitrary  orientations  and  therefore  appear  “wide”  when  projected  onto  a  given  2-dimensions.  As  a  visual  example, 
consider  a  flat  disk-shaped  kernel  in  a  3-dimensional  volume  enclosing  3  widely  separated  training  data  points  in  a  plane. 
When  projected  onto  2  dimensions,  unless  the  two  basis  functions  upon  which  it  is  projected  are  aligned  just  right,  it  will 
appear  wide.  Thus,  it  makes  sense  that  unless  kernel  width  is  constrained,  the  algorithm  will  tend  toward  CK  effect,  yet 
the  projection  of  the  PDF  approximation  onto  2  dimensions  will  widen.  The  use  of  Bayesian  priors,  as  mentioned  above, 
prevents  the  kernels  from  collapsing  to  zero  volume,  but  the  fractional  volume  still  decreases  exponentially. 

Figures  3  through  5  illustrate  this  effect.  The  upper  portion  of  each  plot  is  a  scatter  diagram  of  the  data  samples 
available  in  the  training  set.  The  lower  portion  is  a  contour  plot  of  the  individual  kernels  at  a  fixed  level.  The  Gaussian 
Mixture  is  marginalized  (effectively  integrated  over  all  the  remaining  dimensions).  The  figures  represent  a  sequence  of 
increasing  P.  The  number  of  kernels  is  5  in  all  cases.  The  apparent  kernel  widths  gradually  widen. 

4  Monitoring  PDF  Accuracy 

We  have  argued  that  PDF  approximation  is  the  weak  link  in  high-dimensional  algorithm  performance.  But  how  is  it 
possible  to  monitor  PDF  approximation  accuracy  without  knowing  the  true  PDF?  In  this  section,  we  present  some  ideas. 
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Figure  3:  Gaussian  Mixture  approximation  estimated  at  P  =  2. 
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Figure  4:  Gaussian  Mixture  approximation  estimated  at  P  =  14  and  projected  onto  two  dimensions. 


Figure  5:  Gaussian  Mixture  approximation  estimated  at  P  =  44  and  projected  onto  two  dimensions. 


Figure  6:  Histograms  of  the  Log  Likelihood  for  Testing  data  and  Synthetic  data,  P  =  44.  Solid:  testing,  dashed:  synthetic. 


Figure  7:  Histograms  of  the  Log  Likelihood  for  Testing  data  and  Synthetic  data,  P  =  2.  Solid:  testing,  dashed:  synthetic. 


In  one  and  two  dimensions,  it  is  possible  to  create  plots  such  as  Figures  3  to  visually  inspect  the  PDF  approximation 
comparing  with  actual  data  histograms.  But  in  higher  dimensions,  it  is  no  longer  feasible.  It  is  necessary  to  project  the 
multidimensional  data  and  PDF  approximation  down  to  one  or  two  dimensions  to  visualize  it.  But  still,  such  methods 
depend  on  the  choice  of  projection. 

An  idea  we  like  is  the  following:  Let  there  be  3  data  sets,  training  data  xTr,  testing  data  xTe,  and  synthetic  data 
xSy .  After  obtaining  a  PDF  estimate  p(x)  from  xTr,  generate  a  population  of  synthetic  data  x  Sy  based  on  p(x).  Then, 
compare  histograms  of  logp(xTe),  and  logp(xS2/).  Comparing  these  histograms  can  locate  telltale  problems.  Ifp(x)  is  a 
good  approximation  top(x),  the  two  histograms  will  match.  However,  if  CK  effect  occurs,  logp(x  T)  will  exhibit  a  very 
wide  spread  of  values  including  some  very  large  negative  samples.  If  the  testing  data  falls  inside  the  narrow  kernels,  the 
likelihood  will  be  too  high,  if  it  falls  outside,  it  will  be  much  too  low.  If  EK  effect  were  to  occur,  the  spread  of  values 
would  be  narrow.  Furthermore,  EK  effect  would  be  quite  visible  in  the  2-D  projections. 

Consider  Figure  6,  created  for  a  PDF  estimate  of  dimension  44  (see  Figure  5).  This  shows  the  testing  data  has  both 
higher  and  lower  values  of  likelihood  than  the  synthetic  data,  indicating  CK  effect. 

The  experiment  was  repeated  for  the  case  of  Figure  3  (P  =  2)  and  the  result  is  plotted  in  Figure  7.  The  histograms 
are  much  better  matched  indicating  a  better  PDF  estimate.  The  method  compresses  an  enormous  amount  of  information 
into  a  single  plot  and  may  provide  the  basis  of  non-subjective  measures  of  fit  such  as  total  log-likelihood  of  testing  data. 
This  method  is  not  foolproof  since  the  matching  of  the  log-likelihood  distributions  is  not  a  guarantee  of  a  correct  PDF 
estimate.  Combined  with  the  2-D  projection  method,  however,  it  is  a  powerful  test. 
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5  Conclusions  and  Recommendations 


Starting  from  the  premise  that  dimensionality  plays  a  central  role  in  algorithm  performance,  a  approximate  “conceptual” 
equation  of  dimensionality  has  been  presented.  The  behavior  of  classifiers  at  higher  dimension  has  been  explained  in 
the  context  of  this  relationship.  Specifically,  classifiers  show  a  tendency  to  perform  similarly  to  one  of  two  “primitive” 
classifiers:  a  nearest  neighbor  classifier  or  a  single-kernel  classifier,  or  somewhere  between  the  two. 

We  highly  recommend  doing  three  things  whenever  a  new  PDF  estimator  is  evaluated:  (1)  Plotting  marginalized 
PDF  intensity  and  data  scatter  diagrams  in  2  dimensions  or  data  histograms  together  with  marginalized  PDF  curves,  (2) 
plotting  KNN  and  SBF  performance  on  the  same  graph  as  any  new  algorithm  performance,  (3)  plotting  histograms  of 
log-likelihood  for  the  two  data  populations:  testing  data  and  data  synthesized  from  the  trained  PDF  model.  This  will 
(a)  locate  PDF  approximation  errors,  (b)  put  performance  in  the  perspective  of  “primitive”  methods  (c)  and  determine 
whether  CK  or  EK  effects  are  happening. 
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